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ABSTRACT 


Sample Entropy examines changes in the normal 
distribution of network traffic to identify anomalies. 
Normalized Information examines the overall probability 
distribution in a data set. Random Forests is a supervised 
learning algorithm which is efficient at classifying 
highly-imbalanced data. Anomalies are exceedingly rare 
compared to the overall volume of network traffic. The 
combination of these methods enables low-bandwidth 
anomalies to easily be identified in high-bandwidth network 
traffic. Using only low-dimensional network information 
allows for near real-time identification of anomalies. The 
data set was collected from 1999 DARPA intrusion detection 
evaluation data set. The experiments compare a baseline f- 
score to the observed entropy and normalized information of 
the network. Anomalies that are disguised in network 
flow analysis were detected. Random Forests prove to be 
capable of classifying anomalies using the sample entropy 
and normalized information. Our experiment divided the data 
set into five-minute time slices and found that sample 
entropy and normalized information metrics were successful 
in classifying bad traffic with a recall of .99 and a f- 
score .50 which was 185% better than our baseline. 
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I. 


INTRODUCTION 


A. PROLOGUE 

In this chapter we will introduce the reader of a new 
approach in classifying low volume anomalies in network 
traffic. In addition, we will continue with discussion of 
Sample Entropy and Normalized Information. We will then 
provide a brief description of our hypothesis. We will 
continue with discussion of how we tested our hypothesis 
using the Random Forest Algorithm. We will conclude with a 
synopsis of the remaining chapters of this thesis. 

B. BACKGROUND 

Sample Entropy examines changes in the normal 
distribution of network traffic to identify anomalies. 
Normalized Information examines the overall probability 
distribution in a data set. Random Forests is a supervised 
learning algorithm which is efficient at classifying 
highly-imbalanced data. Anomalies are exceedingly rare 
compared to the overall volume of network traffic. 

C. HYPOTHESIS 

Our hypothesis is that the combination of Sample 
Entropy and Normalized Information will enable low- 
bandwidth anomalies to be identified in high-bandwidth 
network traffic. We anticipate that by only using low¬ 
dimensional network information it may in the future be 
able to allow for near real-time identification of 
anomalies. The data set the hypothesis was tested against 
was the 1999 DARPA intrusion detection evaluation data set. 
The experiments compared a baseline f-score to the observed 
entropy and normalized information of the network. 
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The Random-Forest algorithm is an unsupervised 
machine-learning algorithm which has proven capable at 
classifying highly-imbalanced data sets, but not in the 
field of intrusion detection. Our experiments will address 
whether the combinations of sample entropy and/or 
normalized information processed by a Random-Forest 
algorithm are capable, and the degree of capability in 
identifying low-bandwidth anomalies. These anomalies often 
avoid detection by standard anomaly-based intrusion 
detection systems. 

D. ORGANIZATION OF THIS DOCUMENT 

The remainder of this thesis is organized as follows. 
Chapter II will discuss intrusion detection systems, a few 
common anomalies, different machine-learning algorithms, 
and Stealth Watch an anomaly-based intrusion detection 
system. Chapter III will describe the design of the 
experiment and gathering of a data set needed to test our 
hypothesis. Chapter IV will analyze and discuss the 
results of the experiments. Chapter V offers conclusions 
and recommendations for future work and the Appendix 
contains the code used to transform the original data into 
a suitable format for the experiments. 
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II. BACKGROUND 


A. INTRUSION DETECTION 

An intrusion to a computer or computer network is 
defined by [Canavan. 2001] as "an unauthorized attempt or 
achievement to access, alter, render unavailable, or 


destroy 

information 

on a 

system 

or 

the system 

itself." 

Network 

administrators were previously able to 

review 

various 

logs on a 

daily 

basis 

to 

check for intrusion 

attempts. 

However, 

given 

the growth 

of the Internet and 

the volume of traffic now 

being 

generated on a 

networK- 

means waiting for 

daily 

checks 

is 

too late. 

Another 

approach 

is required. 






Intrusion Detection 

Systems (IDSs) 

began as 

research 

proj ects 

for the US 

government in 

the 

early 1980' 

s. In, 


1980 James Anderson published the first paper in which he 
describes an effort to improve the computer security 
auditing and surveillance of a network. In his 

paper,[Anderson. 1980] the threat was broken into four 
categories: 

1. External Penetration 

An individual from outside the organization attempting 
to gain access to computer network resources; also an 
employee who has physical access but is not an authorized 
computer user. 

2. Internal Penetration 

Anderson breaks this type of penetration into three 

subgroups. He claims that this variant of threat is more 

prevalent than an external threat. 

a. Masquerader 

This is a user who has gained a proper user 

identification and a corresponding password. Locating 
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this type of user can be attempted by looking at audit 
records for deviations from normal activity for a given 
user. 

b. Legitimate User 

Misuse of authorized access to the computer 
network. This might be reveled in an audit log if the user 
is accessing data for which they do not have authorization. 
Once again, a normal profile of user activity on the 
computer system is required to locate anomalies. 

c. Clandestine User 

This is a user who can obtain administrator 
control of a computer and delete or alter the audit trail. 
Here having a reference model of the operating system with 
which to compare the current state of the machine is key to 
detection. Storing audit records in central location not 
on the local machine is another approach which makes hiding 
the activity of a clandestine user much more 
difficult.[Anderson. 1980] 

B. TYPES OF INTRUSION DETECTION SYSTEMS 

Unlike a firewall, intrusion detection systems do not 
block unauthorized packets based on a rule set. An IDS 
instead analyzes the packet header and packet content and 
makes a determination of legitimacy. If a packet is deemed 
malicious an alert is generated, allowing a system 
administrator to examine the packet. 

Intrusion Detection Systems come in two basic types: 
host-based and network-based. The following two sections 
describe these two types in general terms. 

1. Network-based Intrusion Detection 

Network-based intrusion detection systems (NIDS) 
analyze network packets, compare packet structure to know 
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malware patterns, search an internal rule set, then make a 
determination of misuse, and if necessary generate an alarm 
which is reported to a centralized location. A NIDS has 
several advantages. It is effective at detecting outsiders 
attempting to penetrate the network, one or at most a few 
sensors are all that is required to provide coverage for 
the entire network, and since the NIDS is listening for all 
network traffic, it is positioned to detect attacks 
directed at any host on the network. A NIDS, if configured 
appropriately also has the potential to stop an attack 


prior to 

reaching 

the 

hosts. 

A 

NIDS generally runs 

on a 

specially 

built 

machine so 

it 

does not degrade 

the 

computing 

resources of 

individual 

systems. NIDS have 

two 


approaches they use to classify an intrusion. 

a. Signature-based IDS 

This system matches a know signature or pattern 

that was generated by the IDS vendor. The rule set is 
stored locally in every instance of the IDS. It is also 

commonly referred to as rule-based intrusion detection 
(RBID) . When a new attack pattern has been observed it is 
analyzed and a new rule is generated by the vendor. The 
vendor notifies customers that an "update" is available so 
their instance can have the most current rule set. Most 

vendors sell systems which can be configured to 
automatically check with the vendor for updates and 
automatically install them. This method ensures the IDS 
always has the vendors most current rule set and a network 
administrator does not have to spend the time to check on a 
daily basis. 

b . Anomaly-based IDS 

In 1987, Denning [Denning. 1987] described a 

model for a real-time intrusion detection system which 
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built on Anderson's work. Her hypothesis was that security 
violations in the network could be discovered by searching 
for abnormal patterns of use in the system. The model 

created a profile based on statistical metrics for every 
subject in the system and compares the baseline profile to 
current activity searching for deviations from the profile. 
Anomaly-detection defined by Bace is "using statistical 

techniques to find patterns of activity that appear to be 
abnormal."[Bace. 2001] These patterns of activity are 

evaluated for possible signs of malicious activity. While 
these systems have great potential to defend against new or 
unknown attacks, determining what traffic is abnormal is 

still a great challenge. In 2000, Lancope Corporation 
released StealthWatch, one of the first anomaly-based IDS. 
[Lancope. 2006] 

2. Host-based Intrusion Detection systems (HIDS) 

A host-based IDS is designed to monitor and analyze 

data that originates on the individual system on which if 
installed. HIDS are particularly effective at detecting 
misuse of the system by an authorized user.[Proctor. 2001] 
HIDS have several different sources of data available on 
the host, system logs, audit logs, listing of active 
processes, keystroke monitoring, and packet throughput. 

There are several advantages to HIDS including: 

• Actual results of an attack or user misuse of 
system are available. 

• Less reliance on a set of rules. 

• Higher likelihood of detecting an unknown attack. 

• Insiders knowing they are being monitored are 
less likely to misuse the system. 

• No additional hardware requirement. 

• Encrypted network traffic is accessible for 
analyzing. 
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HIDS by themselves have numerous disadvantages. The 
most critical is that a HIDS can have a significant 
performance impact on the host. If the host is compromised 
the system logs, if stored locally, are subject to 
manipulation. Some attacks, like a buffer overflow, are not 
likely to be logged. Finally, if the system is compromised 
the monitoring can be terminated or nullified. [Crothers. 
2003] Therefore, good-quality audit sources are critical. 
These logs should be created by a trusted source, be of 
sufficient detail to recreate every event, and stored off- 
host to protect their integrity.[Proctor. 2001] 

Both network-based IDS and HIDS have advantages and 
drawbacks to specific attack methods but together they 
create a much more effective network defense than either 
alone. Some examples of malicious activity likely to be 
found in a current network that NIDS and HIDS can help 
discover and prevent is presented in the next section. 

C. MALICIOUS ACTIVITY 

Malicious activity can be defined as an intentional 
attempt to bypass computer security measures in some 
fashion. [Crothers. 2003] Users may attempt to download 
music files from a common peer to peer files sharing system 
like KaZaA in violation of company policy. They may 
install an internet shareware game on their computer which 
has a network scanner embedded inside of it. A user could 
open an attachment from an unknown user asking the user to 
"click", which, while displaying the funny video, enables a 
worm to be loaded into the local system. In the following 
section worms network scanners, peer-to-peer software and 
network scanners will be described. 
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1 . 


Worm 


A worm is a self replicating computer program that is 
self-contained and does not need any other software program 
to replicate. The name "worm" comes from The Shockware 
Rider, a science fiction novel written by John Brunner in 
1975. The first worm on a worldwide network was the 
Christmas Tree Worm released in December 1987, which spread 
across IBM's network and BITNET. [Erbschloe. 2005] The 
power of the worm was such both networks were severely 
affected. A worm has four primary qualities: a 
propagation mechanism, transport an executable piece of 
code, identify additional machines vulnerable to the worm 
and various means to attempt to avoid detection. This 
combination of attributes makes a worm appealing to 
malicious users. 

2. Peer-to-Peer 

These are programs that allow you to connect to other 
users to share files, instant message other users text, 
voice messages or files, and conduct distributed 
processing, which utilizes the unused computing power on 
your local computer to create huge computing power 
capacity. They also allow you to create a network to 
upload and download material; this is often music, video 
and games. This ability to upload and download material is 
of great concern to network security personnel. Elies that 
are downloaded can contain additional content; this content 
can be spyware, viruses, Trojan horses, or worms. Once the 
file is downloaded the system can then be exploited and 
serve as a zombie or malware server to spread malicious 
code inside the local area network. These applications 
also allow others to receive access and place files on your 
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local machine without your knowledge. Continuously 
scanning the network for any signs of peer to peer activity 
can help eliminate a common attack vector for malware. One 
of the more dangerous types of malware that is now 
exploiting peer to peer systems are worms, I will give an 
example of one recently identified worm in the next 
section. 

3. Service Discovery 

This is an attempt by an unauthorized user or piece of 
software to discover what applications or computers exist 
in your network. This informs the attacker which computers 
are turned on and what ports they are listening for network 
traffic on. Service discovery is utilized by all levels of 
hackers. SuperScan by FoundStone and Nmap by Insecure are 
two popular tools for service discovery. Network security 
personnel should be concerned if this type of activity is 
detected on the network. Scanning is an indicator that 
service discovery is or has taken place and the attacker 
can now craft an exploit specifically designed to exploit 
vulnerabilities found in your network. 

D. MACHINE-LEARNING ALGORITHMS 

There are several machine-learning algorithms that 
have been created to attempt to find patterns and anomalies 
in data sets. The following sections will describe a few 
of them in detail. 

1. CART 

The classification and regression tree (CART) is a 
general framework in which for a given set of data can be 
broken into smaller subsets determined by category labels. 
Each split is designed to select the best label to split 
upon with the goal of creating a subset of data with the 
exact same categorical values. Data subsets that are not 
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pure are called nodes, and splitting continues until a data 
set ideally contains only the same categorical values. This 
subset is deemed "pure" and splitting is halted on the 
given subset of data. The subset is also classified as a 
leaf of the tree. In highly variable data, a floor can be 
set on the splitting function based on the observations in 
a node, this prevents an expanse of leafs with only one 
observation. [Duda. 2001] 

2. K-MEANS 

The K-means algorithm was introduced by 1982. [Lloyd. 
1982] It remains quite popular due to its simplicity and 
speed. The K-means procedure works as follows. Given a set 
of n size data points, partition the data points in k 
clusters based on local search. A random set of initial k 
cluster centers is chosen. Each point is assigned to the 
closest cluster center determined by minimizing the sum of 
Euclidean distance of its features. The centers of the 
clusters are recomputed based on the new set of data points 
in the given cluster. The procedure is repeated until all 
points are assigned to the cluster that minimizes is 
Euclidean distance. The clusters with their data points 
are then returned from the procedure.[Arthur, et al. 2006] 

3. Hierarchical Agglomerative Algorithm 

The hierarchical agglomerative procedure clusters data 
points as follows. Given n data points, assign each to its 
own cluster. The procedure then searches the space for the 
two clusters having minimum Euclidean distance between the 
vectors. The procedure continues until all clusters have 
been joined into one cluster containing all data points. 
Alternatively, you can force a floor on the number of 
clusters. [Lakhina, et al. 2005] 


10 



4. Random Forests 

Classification accuracy has seen large improvements by 
growing a number of trees and having them vote for the most 
popular class. Random vectors are used to govern the 
growth of individual trees. Breiman demonstrated an early 
form of this method in 1996 with his introduction of 
bagging.[Breiman. 1996] In bagging, trees are grown from 
the training set by taking random examples from the set. 
Dietterich and Breiman continued to refine the randomness 
in [Dietterich. 1998]and [Breiman. 1998]. Ho proposed 

using "the random subspace" method to take a random subset 
of features to grow individual trees, [Ho. 1998] because 
Random Forests are used extensively in our experiments, we 
will describe them more below. 

a. Formal Definition 

Random Forests were formally defined in 2001 as: 

A classifier consisting of a collection of tree-structured 
classifiers {h( x, 0k ) , k=l,...} where the {0k} are 

independent identically distributed random vectors and each 
tree casts a unit vote for the most popular class at input 
X. [Breiman. 2001] The random vector is defined as 0. The 
nature and dimensionality of 0 depends on its use in tree 
construction. 

b. Overfit ting 

Breiman proves with Theorem 1.2 in [Breiman. 
2001] that if you have a large number of trees, the Strong 
Law of Large Numbers and the tree structure will ensure 
that Random Forests will not overfit as additional trees 
are added, rather the additional trees limit the value of 
the generalization error. 
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5. 


Random Forests vs. Adaboost 


Research into Random Forests explored various methods 
to lower the generalization error.[Dietterich. 


1998][Breiman. 

1998, 

Freund, et al. 1996] 

[Bauer, et 

al. 

1999] 





Adaboost 

was 

the benchmark to 

compare 

any 

implementation 

of a 

Random Forest. Breiman worked 

to 

improve accuracy by 

injecting randomness 

to minimize 

the 

correlation p 

while 

maintaining strength. 

Breiman's class 


of random trees had five promising characteristics: 

• Accuracy equal or better to Adaboost 

• Robust handling of outliers and noise 

• Faster than bagging or boosting 

• Provides internal estimates of error, strength, 

correlation and variable importance 

• Simple and easily parallelizable 

a. Empirical Experiments 

Breiman conducted several experiments using 16 
data sets from the University of California Irvine 
repository. Breiman compared two means of growing Random 
Forests, in both a random 10% of the data was set aside. A 
Random Forest was grown to a size of 100 trees, where F is 
the number of inputs to split on. The experiments were run 
twice, once with F=1 and the second time with F equal to 
the result of equation (1.1), where M is the number of 
inputs. 

F = int(log 2 M + 1) (1.1) 

Each method was run 100 times and the test-set errors 

were averaged. For a fair comparison the same procedure 
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was used to separate the data and 50 trees were combined 
for the Adaboost runs. Breiman's results showed that 
Random Forest using a random input(Forest-RI) selection 
were comparable to Adaboost with the added advantage of 
being much faster. A Forest-RI took four minutes to 
execute where Adaboost took approximately three hours. 

Breiman modified the random input concept by 
defining more features by taking random linear combinations 
of a subset of the input variables. This version of the 
Random Forest was called Forest-RC. Forest-RC did better 
compared to Adaboost than Forest-RI. 

b. Noise 

Additional experiments to determine how sensitive 
Random Forests were to mislabeled data, aka "noise", when 
compared to Adaboost. Adaboost had a sharp decrease in 
classification with 5% noise, while for both Random Forests 
procedures noise had only minor changes. 

c. Conclusions 

Breiman demonstrated in 2001 that Random Forests 
are an effective tool in predication. Overfitting is not 
an issue. Breiman's results demonstrated Random Forests 
are at least as accurate as other machine-learning 
algorithms. Another advantage of Random Forests is that 
the training set is not altered throughout the procedure as 
is the case with bagging and boosting.[Breiman. 2001] 

E. CHOOSING A MACHINE-LEARNING ALGORITHM 

Caruana and Niculescu-Mizil in 2006, completed a 
comprehensive empirical study on learning algorithms. This 
was the first large scale comparison since King conducted 
the STATLOG study in 1995. They examined 10 supervised 


13 



learning methods compared them with 8 different performance 
metrics. The results are detailed in Table 1. 


MODEL 

CAL 

COVT 

.ADULT 

LTR.Pl 

LTR.P2 

MEDIS 

SLAC 

HS 

MG 

CALHOUS 

COD 

BACT 

ME.AN 

BST-DT 

PLT 

.938 

.857 

.959 

.970 

.700 

.869 

.933 

.855 

.974 

.915 

.878* 

.896* 

RF 

PLT 

.876 

.930 

.897 

.941 

.810 

.907* 

.884 

.883 

.937 

.903* 

.847 

.892 

BAG-DT 

- 

.878 

.944* 

.883 

.911 

.762 

.898* 

.856 

.898 

.948 

.856 

.920 

.887* 

BST-DT 

ISO 

.922* 

.865 

.901* 

.969 

.692* 

.878 

.927 

.845 

.965 

.912* 

.861 

.885* 

RF 

- 

.876 

.946* 

.883 

.922 

.785 

.912* 

.871 

.891* 

.941 

.874 

.824 

.884 

BAG-DT 

PLT 

.873 

.931 

.877 

.920 

.752 

.885 

.863 

.884 

.944 

.865 

.912* 

.882 

RF 

ISO 

.865 

.934 

.851 

.935 

.767* 

.920 

.877 

.876 

.933 

.897* 

.821 

.880 

BAG-DT 

ISO 

.867 

.933 

.840 

.915 

.749 

.897 

.856 

.884 

.940 

.859 

.907* 

.877 

SVM 

PLT 

.765 

.886 

.936 

.962 

.733 

.866 

.913* 

.816 

.897 

.900* 

.807 

.862 

ANN 

- 

.764 

.884 

.913 

.901 

.791* 

.881 

.932* 

.859 

.923 

.667 

.882 

.854 

SVM 

ISO 

.758 

.882 

.899 

.954 

.693* 

.878 

.907 

.827 

.897 

.900* 

.778 

.852 

ANN 

PLT 

.766 

.872 

.898 

.894 

.775 

.871 

.929* 

.846 

.919 

.665 

.871 

.846 

ANN 

ISO 

.767 

.882 

.821 

.891 

.785* 

.895 

.926* 

.841 

.915 

.672 

.862 

.842 

BST-DT 


.874 

.842 

.875 

.913 

.523 

.807 

.860 

.785 

.933 

.835 

.858 

.828 

KNN 

PLT 

.819 

.785 

.920 

.937 

.626 

.777 

.803 

.844 

.827 

.774 

.855 

.815 

KNN 


.807 

.780 

.912 

.936 

.598 

.800 

.801 

.853 

.827 

.748 

.852 

.810 

KNN 

ISO 

.814 

.784 

.879 

.935 

.633 

.791 

.794 

.832 

.824 

.777 

.833 

.809 

BST-STMP 

PLT 

.644 

.949 

.767 

.688 

.723 

.806 

.800 

.862 

.923 

.622 

.915* 

.791 

SVM 


.696 

.819 

.731 

.860 

.600 

.859 

.788 

.776 

.833 

.864 

.763 

.781 

BST-STMP 

ISO 

.639 

.941 

.700 

.681 

.711 

.807 

.793 

.862 

.912 

.632 

.902* 

.780 

BST-STMP 


.605 

.865 

.540 

.615 

.624 

.779 

.683 

.799 

.817 

.581 

.906* 

.710 

DT 

ISO 

.671 

.869 

.729 

.760 

.424 

.777 

.622 

.815 

.832 

.415 

.884 

.709 

DT 

- 

.652 

.872 

.723 

.763 

.449 

.769 

.609 

.829 

.831 

.389 

.899* 

.708 

DT 

PLT 

.661 

.863 

.734 

.756 

.416 

.779 

.607 

.822 

.826 

.407 

.890* 

.706 

LR 

- 

.625 

.886 

.195 

.448 

.777* 

.852 

.675 

.849 

.838 

.647 

.905* 

.700 

LR 

ISO 

.616 

.881 

.229 

.440 

.763* 

.834 

.659 

.827 

.833 

.636 

.889* 

.692 

LR 

PLT 

.610 

.870 

.185 

.446 

.738 

.835 

.667 

.823 

.832 

.633 

.895 

.685 

NB 

ISO 

.574 

.904 

.674 

.557 

.709 

.724 

.205 

.687 

.758 

.633 

.770 

.654 

NB 

PLT 

.572 

.892 

.648 

.561 

.694 

.732 

.213 

.690 

.755 

.632 

.756 

.650 

NB 

- 

.552 

.843 

.534 

.556 

.011 

.714 

-.654 

.655 

.759 

.636 

.688 

.481 


Table 1 Normalized scores of each learning algorithm 
by problem(averaged over eight metrics) (From Ref. 

[Caruana, et al. 2006]) 

Uncalibrated Random Forests performed best at the 
precision/recall break even point and accuracy metrics and 
across three of the data sets. Calibration of a Random 
Forest only provided a small improvement. [Caruana, et al. 
2006] In 2004, Random Forests were used in classifying 
data sets with highly-imbalanced classes. Often the 
interest is in ensuring correct classification of the 
"rare" class. The way the Random Forest classifier works 
is to assign a weight to each class, with the rare class 
given the larger weight. Weighting occurs twice, once 

for weighting where to split and then in the terminal node. 
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The classification for each node is voted upon in a 
"weighted majority vote." The number of cases in the node 
is multiplied by the weight given to the class of the case 
and the node is classified by taking the higher weight 
class. Random Forests proved to be more robust than CART 
4.5, Neural Nets or SHRINK in classifying highly-imbalanced 
data sets. [Chen, et al. 2004] Therefore, Random Forest 
algorithm as implemented by Breiman and Cutler in the 
Salford Systems Random Forest vl. 0 package will be used to 
classify anomalies in our experiment. 

F. RELATED WORK 

The following section will discuss related work that 
has been done in the problem area of anomaly-detection and 
classification with intrusion detection. 

The majority of recent approaches to classify 
anomalies from network traffic information have focused on 
the changes in volume of network traffic as a key 
metric. [Duan, et al. 2005, Hong Han, et al. 2002, 
Jaroszewicz, et al. 2005, Jian Yin, et al. 2004, Julisch, 
et al. 2002, Khanna, et al. 2006] However, as seen in 
Table 2, anomalies also impact the traffic-feature 
distributions in differing ways. 
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Anomaly Label 

Definition 

Traffic Feature Distributions Affected 

Alpha Flows 

Unusuallv large vohune pomt to pomt 

Source and destination address (possibly 


flow 

ports) 

DOS 

Denial of Seivice Attack (distributed or 
single-source) 

Destination addi'ess, source address 

Flash Crowd 

Unusual burst of traffic to smgle desti¬ 
nation. from a “typical” distnbution of 

soiu'ces 

Destination address, destination port 

Port Scan 

Probes to many destination ports on a 
small set of destination addresses 

Destination address, destination port 

Xetwork Scan 

Probes to many destination addresses on 
a small set of destmation ports 

Destmation address, destination port 

Outage Events 

Traffic shifts due to equipment failures or 
maintenance 

Mainly source and destination address 

Point to Multipoint 

Traffic from smgle source to many desti¬ 
nations, e.g., content distnbution 

Source address, destination address 

Wonns 

Scanning by wonns for whierable hosts 
(special case of Network Scan) 

Destination address and port 


Table 2 Qualitative effects on traffic feature 
distributions by differing anomaly types. (From Ref. 

[Lakhina, et al. 2005]) 

There are several types of anomalies that have very 
little impact on the volume of network traffic and thus can 
escape a volume approach to anomaly-detection. Therefore a 
different approach must be undertaken to locate low volume 
anomalies in network traffic. [Lakhina, et al. 2005] 
Lakhina's hypothesis was that anomalies induce a change in 
the OD flow. A worm will skew distribution for the 
destination addresses, and a skewed distribution for the 
target port the worm is scanning. 

Several machine-learning algorithm approaches have 
been utilized in classifying anomalies in network traffic, 
but Random Forests have not been thoroughly studied for 
their effectiveness in classifying anomalies. [Yang, et al. 
2004, Zhao Junzhong, et al. 2002, Ren, et al. 2004, Chavan, 
et al. 2004, Colombe, et al. 2004] Random Forests have 
been very successful in other domains in classifying 
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highly-imbalanced data. [Chen, et al. 2004, Ham, et al. 

2005, Jian Xue, et al. 2006] 

This paper will use the definition from [Lakhina, et 
al. 2004] to describe traffic features. A traffic feature 

is a field in the header of a packet. Four fields from the 
header will be used to identify anomalies: source address 
denoted (sIP), destination address denoted (dIP), source 
port denoted (sPort), and destination port denoted (pPort). 

A method to measure the uncertainty of a given 
discrete event occurring based on a set of observed 
distributions was first described in 1949.[Shannon, et al. 
1949] This metric as described is known as sample 

entropy. Starting with a discrete set of symbols {si, S 2 ... 
Sn} with associated probabilities Pi, the entropy of the 
discrete distribution is a measure of randomness in the 

sequence of symbols drawn from it is shown in equation 

( 1 . 2 ) . 

n 

Sample Entropy = - S'; log2Pi (1.2) 

i=\ 

The value of sample entropy lies in the range (0,log2 
n) . Note, entropy does not depend on the symbols 

themselves, just on their probabilities. With a given 

number of symbols s, the uniform distribution in which 
every symbol is equally likely to appear, is the maximum 
entropy distribution and H= log 2 m. Minimum entropy 

distribution occurs when distribution is totally 
concentrated, here the metric takes on a value of H = 0. 
[Duda. 2001] 
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Sample entropy can be used to estimate the actual 
entropy of the random behavior of 1999 DARPA data set. 
This paper will not assume to capture the actual randomness 
behavior of all five weeks of the 1999 DARPA data traffic. 
Rather we will use sample entropy as a metric to capture 
the frequency tendency of the distribution of only the 
observed data set. 

In this thesis, sample entropy is computed from 
feature distributions gathered from probe counts. Sample 
entropy's range of values depends on the number of distinct 
values seen in the observed data set. 

We also calculate another metric which we call 
Normalized Information, from equations (1.3) and (1.4). 
Information is calculated by finding the data set frequency 
distribution for a feature. Let Pi be the probability of a 
feature occurring in the overall data set. The value of 
information in a five-minute time slice is normalized by 
the average number of bytes per packet in a given time 
slice as see in equation (1.4) . 


n 


Infomation = -Y}og2P. 

i=\ 

Normalized Information = 


Information 

Avg Num Bytes Per Packet 


(1.3) 


(1.4) 


G. CHAPTER SUMMARY 

In summary, this chapter described, at a high level, 
differing ways to intrude into a computer network and 
systems designed to detect that behavior. In addition, 
three types of malicious activity were described. Four 
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machine-learning algorithms were introduced with the Random 
Forest algorithm covered in greater detail. The reasons 
for choosing the Random Forest algorithm as our classifier 
was also discussed. Finally, sample entropy and normalized 
information were defined. 
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Ill. 


DESIGN OF EXPERIMENT 


In the last chapter we described the several malicious 
attacks and various machine-learning algorithms. In this 
chapter, we will describe how we generate a matrix of 
vectors for each data set. We then describe how we used 
this labeled data to run a series of Random Forest 
experiments with the goal of predicting classification 
labels from the predictor vectors. 

A. EXPERIMENTAL OVERVIEW 

Machine-learning algorithms have been utilized in 
anomaly-detection experiments. However, prior to May of 
2006, there had not been any published results using Random 
Forests. Since the ratio of attack traffic to normal 
traffic is highly-imbalanced, we selected the Random Forest 
algorithm. 

The remainder of this chapter details the data set, 
code used transform the data set, software packages used in 
conducting the experiments, and problems encountered in 
conduct of the experiment. 

B. DATA SET 

To run our experiments we used traffic generated by 
the MIT Lincoln Labs as part of the DARPA 1999 IDS 
evaluation. [Zissman. 2002] The evaluation has five weeks 
of traffic, divided into two sections. The first portion 
of the evaluation is three weeks of training data. Only 
the second week of this data contained attacks. The second 
portion of the evaluation is two weeks of test data. Each 
week of data had five days of traffic. Traffic was 
collected at two points in the network, inside and outside 
the boundary router. Data collection began approximately 
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8am each day and stopped at 6am the following day this 
means each block of traffic contained approximately 22 
hours of traffic. The simulation was then shut down for 
maintenance and restarted at Sam for the next day's data 
block. 

Collected Data 

We took the inside and outside tcpdump data from 
second week of training data and the first week of test 
data for our experimental data. There was an error on the 
second day of test week when the network traffic sniffer 
did not collect inside traffic. The first day of the second 
week of test data was used to ensure that there was five 

days of traffic from both the training and test data for 

evaluation. 

C. TRANSFORMING THE DATA 

The DARPA data contained full Ethernet packets. To 

run the experiments, we needed to extract six features from 

each packet: timestamp, source IP address, destination IP 
address, source port number, destination port number, and 
number of bytes in the packet. All code used to transform 
the data can be found in Appendix A. 

1. Sampling 

To simulate sampling from live traffic only every 
tenth packet was chosen for the experimental data set. 
Sampling was conducted separately on the inside and outside 
tcpdump files. This sampling was done with Sampler.java. 

2. Extracting Features from Full Packet Data 

We explored various means for extracting the six 
features from a packet. Network traffic viewers like 
Wireshark and Tcpdump were tested for ability to extract 
the features and were found to be excellent on filtering on 
the information match of a particular feature. This 
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products could not just extract information from an 
identified feature in the packet, therefore another tool 

was required. Additional research located SiLK: The 
Analyst's Suite. [SourceForge. 2006] This suite was 

created by Carnegie Mellon University's Computer Emergency 
Response Team (CERT) to examine traffic throughout a 

network, observe malicious activity, trace server behavior 
and other analytical tools. [SourceEorge. 2006]The SiLK 

software package currently only works on LINUX type 
operating systems. We utilized three analytical tools to 
extract the features from the tcpdump data files. 

a. Converting TCPDUMP to SiLK files 
We had to first convert the data from Tcpdump 

file format into a SiLK flow record. SiLK flow records 
collapse fragmented packets into one flow record. This 
allows for addition of OSI layer four information to the 
flows. We used the rwptoflow utility to make the 

conversion. Eigure 1 shows the command line used to 
transform a file into a SiLK flow record. 


C:s> 

C:S>C:\DARPA data\test data>i*uptof lou insideSMar.tcpdunp > in side 8 Mai*, ru 


Eigure 1. Rwptoflow file conversion 

b. Extracting Features from a SiLK File 

Once we had the data in a SiLK file, we were able 
to employ the rwcut to print selected fields to a new file. 
We used the -delimited option to utilize a comma to 
separate output files and the -fields option to select the 
fields to be sent to the output file. Eigure 2 provides an 
example of the command. 
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C:\> 

te:\>i*wcut inside8Mar.rw -delinited=, -fieIds=sIP,dIP,sPort,dPort,bytes, eTime 
> insideSflar.txt 


Figure 2. Rwcut command to extract selected fields to a 

designated output file 

The data was now in a flat file, with each field 
separated by commas, and had only one packet per line. The 
last field also has a comma after it to allow for ease of 
reordering fields if necessary. Figure 3 shows one line of 
output ordered per the rwcut utility: Source IP, 

Destination address. Source Port, Destination Port, number 
of bytes in the packet, and the start time of the packet. 
Time is formatted with 24 hour numbering and the time zone 
is Greenwich Time. 


196.37.75.158,172.16.112.194,25,1024,15872,1999/03/29X13:00:04.106, 

Figure 3. Extracted data fields 

3. Calculating Aggregate Data for a Given Five- 

minute Time Slice 

a. Sorting the Data by Time 

We now had the data in separate flat files and 
needed to combine them into one large file ordered by their 
timestamp. We used the DOS copy command to append the 
files into one large file. We created two small Perl 
programs called Reorder.pl and Sorter.pl to reorder the 
fields with the timestamp first, this allows the Sort.pl to 
sort the data on that field and output the data back into 
the same file in ascending order. 
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b. Placing Packets in Five—minute Time Slices 

We utilized a Perl package called Date::Calc 
which allows for comparison of two dates. The package 
contains a Delta function to determine the difference 
between two times. We decided to split our data into five- 
minute time slices to allow for comparison to related work. 
We created an array of arrays, each array contained five- 
minutes worth of packets. We used Entropy_Info.pl to do 
this comparison. Figure 4 illustrates the average number 
of packets calculated per time slice. 


Week 2 A\^iage Nimber of Packers per 5 Mriute Tme Slice 



5 IftHite line Sices 

Figure 4. Average Number of Packets per Time Slice 
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c. Overall Probability Distribution 

We created a hash table for Source IP, 
Destination address. Source Port, Destination Port. We 
also counted the total number of packets in the file. We 
are able to determine for each unique key in the hash 
table, the number of counts for that key and the overall 
probability for the key in the distribution. The key is 
associated with the overall probability and they are 
written to a new hash table. We used Entropy_Info.pl to do 
this comparison. 

d. Five—minute Time Slice Information 

We examine each five-minute slice of traffic for 
each of the four features. For each unique instance of a 
feature, we find the log 2 of that instance's probability 
from the overall distribution calculated earlier and sum 
them for the overall information contained in the five- 
minute time slice. We used Entropy_Info.pl to do this 
comparison. 

e. Normalizing Information 

Since the number of packets in each time slice 
varied greatly we needed to normalize the information by 
dividing each raw feature value by the number of packets in 
a time slice. This calculation was done using Excel. 

f. Five—minute Time Slice Entropy 

We examine each five-minute slide of traffic for 
each of the four features. For each unique instance of a 
feature, we calculate the instance's probability from the 
five-minute probability distribution. This probability is 
multiplied by the log 2 of the probability and summed for the 
entropy of the five-minute time slice. We used 
Entropy_Info.pl to perform the calculations. 
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g. Average Number of Bytes per Packet in a 
Five—minute Time Slice 

We calculated the average number of bytes in the 
time slice by determining the total number of bytes in a 
time slice and simply dividing by total number of packets 
in a time slice. We used Bytes.pl to extract total number 
of bytes per time slice and imported that data into the 
Excel spreadsheet containing the other data 
D. RANDOM FORESTS 

This section will describe the basic setup of Salford 
Systems implementation of Random Forests (RF) . A trial 
version of this software package is available for 30 days. 

1. Variable Selection 

We loaded our data in a CSV format. Figure 5 shows 
the initial menu after the data is loaded. We would select 
the four predictor variables and a target variable. The 
target variable will be what the RF attempts to classify 
based off of the predicator variables. 
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Figure 5. Random Forest Variable Selection 

2. Testing 

One of the major advantages of RF is it does not 
modify the original data set. RF by default uses 

Out-of-Bag data for testing. It does this by pulling out 
approximately one-third of the data for self-testing. This 
is an extension of cross-validation which is repeated 
several hundred times. This ensures a high reliability. 
Figure 6 shows how the weights of each class can be 
modified. Figure 7 shows how you can modify the testing 
process if desired. 

If classifying one class is important, weighting for that 
class can be specified orders of magnitude higher than 
other classes. 
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Figure 6. Weighting Choices for training and testing 



Figure 7. Options for Testing the Forest 


29 






















































3. 


Random Forest Tree Choices 


The next screen allows the tester to choose the number 
of trees to be grown, number of predictors for a node, the 
size of the bootstrap sample. Figure 8 shows this clearly. 
The manual recommends that the number of predicators for 
each node should be the square root of the total number of 
predictors. 



Figure 8. Options for Testing the Forest 


3. Experiment Parameters 

Several experiments were run on four combinations of 
the data sets. The four data sets are as follows: 
Normalized Information, Sample Entropy, Normalized 
Information and Sample Entropy, and Normalized Information 
with Sample Entropy and average number of bytes per packet 
which we defined in Chapter II. 
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a. Variables 

All available predicators in each data were 
utilized. Classification was always the target variable. 

b. Model Adjustments 

Each of the following parameters in the model 
were adjusted independently: The number of trees to be 
built was varied from 100, 250, 500 and 1000. For each 
size of the forest, class weights were adjusted between 
balanced classes and weighting attack traffic to 10, 100, 

and 1000. The weight of normal traffic was kept constant 
at 1. 

E. PROBLEMS DISCOVERED IN CONDUCT OF THE EXPERIMENTS 

There were a few unexpected problems that occurred 
while we conducted these experiments that should be noted. 

1. DARPA 1999 Training Data 

The week two training data which contains the attack 
sequences only lists the starting times for the attacks. 
In the test data the duration of attacks were also 
recorded. This proved to be significant as several attacks 
spanned greater than 10 minutes. This meant that one of 
these attacks must result in multiple five-minute time 
slices being treated as containing attack traffic. 
Therefore the second week of training data was not utilized 
in obtaining our experimental results. 

2. Stealth Watch 

Data was initially collected from the Naval 
Postgraduate School's network. Stealth Watch stores probes 
for 30 days, which would allow for a robust data set. A 
careful examination of the probe data set showed that only 
highly suspicious probes were present in the data set. 
Using this data would not provide the correct balance of 
normal to attack packets 
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3. Wire Shark 

A second attempt was made to collect raw packets from 
the Naval Postgraduate School's network utilizing Wire 
Shark, the packet sniffer formally known as Ethereal, in 
this attempt seven days of traffic would be collected and 
attack traffic would be identified after the fact utilizing 
Snort and Stealth Watch. One second of network traffic 
sampled at 2:30 pm on a weekday generated a file two Mega 
Bytes in size. There was sufficient space on the campus 
storage area network to store the files until they could be 
reduced using the SiLK suite. However, Wire Shark was 
generating temporary files on the collection server and 
within minutes would consume all free disk space available 
and crash the service. Limiting packet captures to 68 bytes 
increased the time to service failure but not enough to 
make it a viable approach for large amounts of traffic. 
Using multiple files was also attempted without success. 
The first file would write correctly and then the service 
would crash when attempting to write to the second file. 

F. CHAPTER SUMMARY 

In summary, this chapter described the experiment's 
data set, how the data set was transformed, the parameters 
for the experiment runs and ended with a discussion of 
problems encountered in conducting the experiment. 
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IV. DATA RESULTS AND EVALUATION 


In the last chapter, we described the experiment 
design. In this chapter, we will describe the results of 
the series of Random Forest experiments. 

A. BASELINE: CALL-ALL-BAD 

In all the experiments present in this chapter, the 
baseline used is Call-all-bad. We use the f-score equation 
(1.5) rather than harmonic mean equation (1.6) to evaluate 
our results. We assumed an algorithm that labels all 
observations as a bad. Precision is calculated simply as 
the proportion of actual bads in the dataset. Recall will 
always be 1. 


F - Score = 


2 

1 r 

—I— 

P R 


Harmonic Mean = 


Z n 1 

i=l 


Xi 


(1.5) 


(1.6) 


The baseline results are conservative since the recall 
score is 1, which will increase the baseline f-score. The 
key point is that the recall is not at the expense of 
precision. If it were, then the f-score being, a special 
case of the harmonic mean would be lower. To show this, 
first compute the f-score with p=.6 and r=.6. You can see 
the answer is .6, identical to the arithmetic mean. 
However if you adjust your algorithm such that recall is 
increased to 1 while precision is lowered to .1, the 
harmonic mean is lower than the arithmetic mean. That is, 
the arithmetic mean would be .55, while the f-score would 
be . 181. [Martell. 2005] 
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1. 1999 DARPA Week Two Test Baseline 

There are 1099 good observations and 202 bad 
observations in the data set. This generates a precision 
of .155 and a recall of 1 resulting in a f-score baseline 
of .27. 

A. NORMALIZED INFORMATION EXPERIMENTS 

In this section we present the results of our 
experiments using normalized information. This section is 
divided into balanced and unbalanced experiments. The 
balanced experiments attempt to maximize precision and 
recall for both good and bad, while unbalanced experiments 
try to maximize recall at the expense of precision. The 
unbalanced experiments were done because for defense 
purposes, we are far more concerned that all bad 
observations are captured. A result of this weighting is 
that some good observations will be included in the bad 
classification observations. 

1. Balanced Experiments 

The results of the balanced experiments are given in 
Table 3. All the experiments are versions of Random 
Forests with the differences being in the number of trees 
used. 
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Normalized Information Balanced 


Precision 

Recall 

F-Score 

Increase over Baseline 

Call-all-Bad 

0.16 

1.00 

0.27 


100 Trees 

0.68 

0.70 

0.69 

257% 






Call-all-Bad 

0.16 

1.00 

0.27 


250 Trees 

0.66 

0.74 

0.70 

260% 






Call-all-Bad 

0.16 

1.00 

0.27 


500 Trees 

0.67 

0.71 

0.69 

257% 






Call-all-Bad 

0.16 

1.00 

0.27 


1000 Trees 

0.66 

0.72 

0.69 

256% 


Table 3 Precision, Recall, and F-Score Results for 

Balanced Weighting on the Normalized Information 

Data Set 

2. Unbalanced Experiments 

The results of the unbalanced experiments are given in 
Tables 4-6. All the experiments are versions of Random 
Forests with the differences in the number of trees used. 



Normalized Information Sac/Weight 10 


Precision 

Recall 

F-Score 

Increase over Baseline 

Call-all-Bad 

0.16 

1.00 

0.27 


100 Trees 

0.58 

0.80 

0.67 

251% 






Call-all-Bad 

0.16 

1.00 

0.27 


250 Trees 

0.58 

0.82 

0.68 

252% 






Call-all-Bad 

0.16 

1.00 

0.27 


500 Trees 

0.58 

0.82 

0.68 

252% 






Call-all-Bad 

0.16 

1.00 

0.27 


1000 Trees 

0.58 

0.83 

0.68 

253% 


Table 4 Precision, Recall, and F-Score Results for Bad 
Weighted 10 on the Normalized Information Data Set 
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Normalized Information Sac/Weight 100 


Precision 

Recall 

F-Score 

Increase over Baseline 

Cal l-al 1-Bad 

0.16 

1.00 

0.27 


100 Trees 

0.42 

0.94 

0.58 

216% 






Call-all-Bad 

0.16 

1.00 

0.27 


250 Trees 

0.38 

0.96 

0.54 

201% 






Call-all-Bad 

0.16 

1.00 

0.27 


500 Trees 

0.38 

0.96 

0.55 

203% 






Call-all-Bad 

0.16 

1.00 

0.27 


1000 Trees 

0.38 

0.96 

0.55 

204% 


Table 5 Precision, Recall, and F-Score Results for Bad 
Weighted 100 on the Normalized Information Data Set 



Normalized Information Sac/Weight 1000 


Precision 

Recall 

F-Score 

Increase over Baseline 

Call-all-Bad 

0.16 

1.00 

0.27 


100 Trees 

0.41 

0.96 

0.57 

213% 






Call-all-Bad 

0.16 

1.00 

0.27 


250 Trees 

0.36 

0.97 

0.52 

195% 






Call-all-Bad 

0.16 

1.00 

0.27 


500 Trees 

0.34 

0.99 

0.51 

189% 






Call-all-Bad 

0.16 

1.00 

0.27 


1000 Trees 

0.32 

0.99 

0.49 

181% 


Table 6 Precision, Recall, and F-Score Results for Bad 
Weighted 1000 on the Normalized Information Data Set 

3. Results 

It is interesting that the 1000 tree unbalanced data 
with bad weighted at 10 experiment run was able to increase 
the recall by .07 while only reducing precision by .08 as 
compared to the 250 trees balanced data run resulting in a 
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f-score decrease of only .02. It also was interesting in 
that there was a clear recall increase by continually 

weighting bad heavier, so that at 500 trees with a bad 
weight of 1000 experiment run, a recall of .99 was 

achieved, at a of cost of precision dropping to .34 for a 
f-score of .51. We also note that growing the Random 

Forest to 1000 trees could hurt the precision in certain 
runs of the experiment. 

B. SAMPLE ENTROPY 

In this section we present the results of our 
experiments using sample entropy. As before, this section 
is divided into balanced and unbalanced experiments 

1. Balanced Experiments 

The results of the balanced experiments are given in 
Table 7. All the experiments are versions of Random 

Forests with the differences being in the number of trees 
used. 



Sample Entropy Balanced 


Precision 

Recall 

F-Score 

Increase over Baseline 

Call-all-Bad 

0.16 

1.00 

0.27 


100 Trees 

0.63 

0.65 

0.64 

239% 






Call-all-Bad 

0.16 

1.00 

0.27 


250 Trees 

0.65 

0.67 

0.66 

246% 






Call-all-Bad 

0.16 

1.00 

0.27 


500 Trees 

0.65 

0.67 

0.66 

247% 






Call-all-Bad 

0.16 

1.00 

0.27 


1000 Trees 

0.65 

0.67 

0.66 

246% 


Table 7 Precision, Recall, and F-Score Results for 
Balanced Weighting on the Sample Entropy Data Set 
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2. Unbalanced Experiments 

The results of the unbalanced experiments are given in 
Tables 8-10. All the experiments are versions of Random 
Forests with the differences being in the number of trees 
used. 



Sami 

pie Entropy Sac/Weight 10 


Precision 

Recall 

F-Score 

Increase over Baseline 

Call-all-Bad 

0.16 

1.00 

0.27 


100 Trees 

0.54 

0.78 

0.64 

239% 






Call-all-Bad 

0.16 

1.00 

0.27 


250 Trees 

0.54 

0.80 

0.64 

239% 






Call-all-Bad 

0.16 

1.00 

0.27 


500 Trees 

0.53 

0.82 

0.64 

239% 






Call-all-Bad 

0.16 

1.00 

0.27 


1000 Trees 

0.53 

0.81 

0.64 

238% 


Table 8 Precision, Recall, and F-Score Results for Bad 
Weight 10 on the Sample Entropy Data Set 



Sample Entropy Sac/Weight 100 


Precision 

Recall 

F-Score 

Increase over Baseline 

Call-all-Bad 

0.16 

1.00 

0.27 


100 Trees 

0.37 

0.93 

0.53 

195% 






Call-all-Bad 

0.16 

1.00 

0.27 


250 Trees 

0.34 

0.94 

0.50 

186% 






Call-all-Bad 

0.16 

1.00 

0.27 


500 Trees 

0.35 

0.96 

0.52 

192% 






Call-all-Bad 

0.16 

1.00 

0.27 


1000 Trees 

0.35 

0.96 

0.51 

190% 


Table 9 Precision, Recall, and F-Score Results for Bad 
Weight 100 on the Sample Entropy Data Set 
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Sample Entropy Sac/Weight 1000 


Precision 

Recall 

F-Score 

Increase over Baseline 

Call-all-Bad 

0.16 

1.00 

0.27 


100 Trees 

0.38 

0.94 

0.54 

202% 






Call-all-Bad 

0.16 

1.00 

0.27 


250 Trees 

0.34 

0.97 

0.50 

187% 






Call-all-Bad 

0.16 

1.00 

0.27 


500 Trees 

0.32 

0.98 

0.48 

179% 






Call-all-Bad 

0.16 

1.00 

0.27 


1000 Trees 

0.32 

0.98 

0.49 

181% 


Table 10 Precision, Recall, and F-Score Results for 

Bad Weight 1000 on the Sample Entropy Data Set 

3. Results 

These results were interesting in that growing forests 
from a size of 100 to 1000 only increased the recall from 
.03 to .05 for a given weight factor. In this series of 
runs, the best achieved was a .98 by weighting bad to 1000, 
this resulted in a precision of .32 for a f-score of .49. 

C. NORMALIZED INFORMATION AND SAMPLE ENTROPY 

In this section we present the results of our 
experiments using normalized information and sample 
entropy. As before, this section is divided into balanced 
and unbalanced experiments. 

1. Balanced Experiments 

The results of the balanced experiments are given in 
Table 11. All the experiments are versions of Random 
Forests with the differences being in the number of trees 
used. 
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Normalized Information + Sample Entropy Balanced 


Precision 

Recall 

F-Score 

Increase over Baseline 

Call-all-Bad 

0.16 

1.00 

0.27 


100 Trees 

0.64 

0.73 

0.68 

255% 






Call-all-Bad 

0.16 

1.00 

0.27 


250 Trees 

0.64 

0.77 

0.70 

261% 






Call-all-Bad 

0.16 

1.00 

0.27 


500 Trees 

0.64 

0.76 

0.69 

259% 






Call-all-Bad 

0.16 

1.00 

0.27 


1000 Trees 

0.64 

0.78 

0.70 

260% 


Table 11 Precision, Recall, and F-Score Results for 

Balanced Weighting on the Normalized Information and 
Sample Entropy Data Set 


2. Unbalanced Experiments 

The results of the unbalanced experiments are given in 
Tables 12-14. All the experiments are versions of Random 
Forests with the differences being in the number of trees 
used. 



Normalized Information + Sample Entropy Sac/Weight 10 


Precision 

Recall 

F-Score 

Increase over Baseline 

Call-all-Bad 

0.16 

1.00 

0.27 


100 Trees 

0.56 

0.89 

0.68 

254% 






Call-all-Bad 

0.16 

1.00 

0.27 


250 Trees 

0.53 

0.89 

0.67 

248% 






Call-all-Bad 

0.16 

1.00 

0.27 


500 Trees 

0.54 

0.91 

0.67 

251% 






Call-all-Bad 

0.16 

1.00 

0.27 


1000 Trees 

0.53 

0.91 

0.67 

249% 


Table 12 Precision, Recall, and F-Score Results for 

Bad Weight 10 on the Normalized Information and Sample 

Entropy Data Set 
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Normalized Information + Sample Entropy Sac/Weight 100 


Precision 

Recall 

F-Score 

Increase over Baseline 

Call-all-Bad 

0.16 

1.00 

0.27 


100 Trees 

0.42 

0.94 

0.58 

216% 






Call-all-Bad 

0.16 

1.00 

0.27 


250 Trees 

0.38 

0.98 

0.55 

205% 






Call-all-Bad 

0.16 

1.00 

0.27 


500 Trees 

0.39 

0.98 

0.56 

207% 






Call-all-Bad 

0.16 

1.00 

0.27 


1000 Trees 

0.39 

0.97 

0.55 

205% 


Table 13 Precision, Recall, and F-Score Results for 
Bad Weight 100 on the Normalized Information and Sample 

Entropy Data Set 



Normalized Information + Sample Entropy Sac/Weight 1000 


Precision 

Recall 

F-Score 

Increase over Baseline 

Call-all-Bad 

0.16 

1.00 

0.27 


100 Trees 

0.42 

0.95 

0.59 

218% 






Call-all-Bad 

0.16 

1.00 

0.27 


250 Trees 

0.37 

0.97 

0.54 

199% 






Call-all-Bad 

0.16 

1.00 

0.27 


500 Trees 

0.33 

0.98 

0.50 

185% 






Call-all-Bad 

0.16 

1.00 

0.27 


1000 Trees 

0.30 

0.99 

0.47 

173% 


Table 14 Precision, Recall, and F-Score Results for 

Bad Weight 1000 on the Normalized Information and 
Sample Entropy Data Set 
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3. 


Results 


It was interesting that for the balanced weighting 
there was no statistical gain in growing the forest larger 
than 250 trees. It was also interesting that with a bad 
weight of 100 a recall of .98 and a precision of .39 
resulting in a f-score of .56 was achievable at 500 trees. 
Recall was able to see a gain of .04 at 500 trees while 
only causing a reduction in the by .02 compared to the 100 
tree f-score. 

D. NORMALIZED INFORMATION, SAMPLE ENTROPY AND AVERAGE 

BYTES PER PACKET 

In this section we present the results of our 
experiments as seen in Table 15 using normalized 
information, sample entropy and average bytes per packet. 

1. 500 Tree Experiments 



Normalized Information + Sample Entropy + Avg Bytes 500 Trees 


Precision 

Recall 

F-Score 

Increase over Baseline 

Call-all-Bad 

0.16 

1.00 

0.27 


Balanced 

0.62 

0.79 

0.69 

257% 






Call-all-Bad 

0.16 

1.00 

0.27 


Bad Wgt 10 

0.52 

0.92 

0.66 

247% 






Call-all-Bad 

0.16 

1.00 

0.27 


Bad Wgt 100 

0.38 

0.97 

0.55 

204% 






Call-all-Bad 

0.16 

1.00 

0.27 


Bad Wgt 1000 

0.33 

0.99 

0.50 

185% 


Table 15 Precision, Recall, and F-Score Results for 
500 Trees with all Weightings on the Normalized 
Information, Sample Entropy and Average Bytes per 

Packet Data Set 
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2. Results 

Since the previous three sets of experiments showed 
very minor gains by going beyond 500 trees, we decided to 
run this series of experiments at 500 trees. This run of 
experiments was not expected to show better results than 
the earlier data sets and was run only because the data was 
available. It was very interesting that it was possible to 
achieve a recall of .99 with a bad weighting of 1000, 
precision however, was only .33 resulting in a f-score of 
.50. 

E. EVALUATION OF OVERALL RESULTS 

Overall we found it very interesting that the worst f- 
score result was a .47 from the Normalized Information and 
Sample Entropy 1000 trees bad weight of 1000 run. This 
result still beat the baseline f-score by 173%. However, 
more importantly with this f-score, recall was .99, this 
metric is the focus for intrusion detection. 

We also found it puzzling that the Normalized 
Information metric independently could achieve a higher 
recall than Sample Entropy. Eurther work is needed to 
analyze this result. 

A extremely good result was the ability to obtain a 
recall of .96 with only a Random Eorest of 100 trees. This 
result came from the Normalized Information with a bad 
weight of 1000. This result can be run on a laptop running 
a 2GHz Pentium! processor with 384MB of RAM in under 30 
seconds. It shows the possibility of conducting near real¬ 
time analysis of traffic and locating attack traffic that 
is getting past a rule-based intrusion detection system. 
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F. CHAPTER SUMMARY 

In summary, this chapter detailed the results from 
utilizing the four different combinations of variables 
varying the number of trees grown and the weight of the bad 
data. In the next chapter, we will discuss our conclusions 
and layout possible future work. 
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V. 


CONCLUSION AND FUTURE WORK 


A. CONCLUSION 

Random Forests were able to classify anomalies in the 
1999 DARPA data set. Using only six features from the 
TCP/IP header data. Random Forests could identify over 99% 
of five-minute time slices containing attack traffic. This 
recall comes at a significant reduction in the overall f- 

score dropping it from .65 to .34. However, this is a 

worthwhile trade off since in intrusion detection the goal 
is to ensure all attack traffic is captured. 

We conclude that the distribution of packet features 
(IP addresses and ports) reveals the presence of a wide 

range of attack traffic. Sample entropy and normalized 
information are capable automatic classification of 
anomalies via unsupervised learning. 

B. FUTURE WORK 

A goal of this thesis was to determine the 

effectiveness of Random Forests in classifying anomalies in 
network traffic. Therefore, future work should include 
testing Random Forests on additional intrusion detection 
data sets. There is only the DARPA data set from 1998 and 
1999 currently available for scientific researchers to run 
experiments. It would be interesting to test on the Abilene 
and Geant Data sets to determine the effectiveness of 
Random Forests on that data set.^ 

A huge boon to the intrusion detection scientific 
community would be to develop and make available a labeled 
data set from the Naval Postgraduate School Network. 

^ Computer Science, Boston University 111 Cummington Street, Boston, 
MA 02215 I Telephone 617 353 8919 [ E-mail cs@bu.edu 
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We also believe it would be interesting to transform 
the 1999 DARPA data set using the subspace method and 
Multi-way method combining all vectors into one to allow 
for a truer comparison to the results of [Lakhina, et al. 
2005] . 

Another idea would be compare the results of Random 
Forests to the K-means algorithm. Unfortunately in our 
experiments, the K-means results did not cluster the data 
set sufficiently to allow for a scientific comparison. 
The data sets should be run on other implementations of the 
K-means algorithm to confirm these results. 

Finally, we identify issues that could aid in the 
advancement of intrusion detection research: 

1. Develop a more efficient way of automatically 

consolidating, transforming, and analyzing extracted data. 
One possible approach would be to combine the various 
programs written for this thesis into one program which 
would automatically generate files for Random Forest to 
classify. Random Forest algorithm as implemented by 

Salford Systems is capable of running batch jobs. The 
automatically generated files could be a run by a batch job 
and labeled as good or bad. This would allow a network 
analyst to focus on labeled bad traffic. 

2. Explore the importance of the predictor 
variables, and discover if the predictors are constant 
across the four data sets. Our experiments indicated that 
the source port was the key predictor of bad traffic and 
that the destination port was relatively unimportant. 
Evaluating these variables with principle component 
analysis could provide further insight into these findings. 
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3. Develop a prototype classifier to take sample 
entropy from near real-time traffic. This could be 
designed to work with a hard or sliding five minute window. 
The time-slices would then be automatically processed and 
run as a batch job by the Random Forest algorithm, which 
would label the time-slice as good or bad. The results of 
the Random Forest would generate alerts for bad traffic. 

4. In parallel to the prototype classifier run a 
standard rule-based IDS like Snort. The snort alerts could 
be correlated with alerts from the Random Forest classifier 
and items with a low correlation would be flagged for human 
examination. The low correlation might indicate bad 
traffic that evaded Snort. 
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APPENDIX A. GENERATED CODE 


A. SAMPLER.JAVA 

import java.io.*; 

import java.math.*; 
import javax.swing; 
import java.lang.Integer; 

/** 

* Written by Bret Hyla 

* extracts 1 in 10 lines from a text file. 

* Sept 2006 NFS 
*/ 

public class Sampler { 

public int counter =0; 


public Sampler () { 


} 

public static void main(String args[]) 

{ Sampler run= new Sampler(); 
int 1=0; 
int count=0; 

try { 

BufferedReader reader = new BufferedReader(new 

FileReader("FullDataSet.txt")); 

try { 


BufferedWriter writer = new BufferedWriter(new 

FileWriter("SampledDataSet.txt")); 

while (true){ 


String test =reader.readLine(); 

if (test==null){ 

writer.close () ; 
break; 

} 

while( i<10 ) { 
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test =reader.readLine() ; 


i=i + l ; 


} 

if (test==null){ 
writer.close () ; 

break; 


} 

writer.write(test); 


writer.newLine (); 


count++; 
i=0; 

} 

writer.close(); 


} catch(lOException ex) { //2 


ex.printStackTrace() ; //2 

} 112 

reader.close() ; 

} catch (lOException ex) { //2 


ex.printStackTrace0; //2 

} 112 

} 

} 
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B. SORTER.pl 

# works with following format on CL sorter.pl followed by 
filename 

# file is sorted on first element, if tie then second etc 

# base file came from 

http://www.developertutorials.com/tutorials/cgi-perl/perl- 
sorting-050423/pagel.html 

# Bret Hyla 

# NFS 

open(SORT,">bytes_sorted.txt"); 
my @words=<>; 

foreach(sort @words) { 
print SORT ; 

} 

close SORT; 

c. reorder.pl 

# The print SOURCESORTED line can contain any 

# of the variables listed in the while loop. 

# The order can be modified simply by choosing 

# the variable to be put first. 

# Bret Hyla September 2006 

open(FILE, "allsampleddata.txt"); 
open(SOURCESORTED, ">bytes.txt"); 
while (<FILE>) { 

$line=$ ; 


($sIP, $dIP, $sPort, $dPort, $bytes, $time, $comma)= 
(split 1 , 1 , $line); 

print SOURCESORTED "$time,$bytes \n"; 

} 

close FILE; 

close SOURCESORTED; 
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D. BYTES.PL 

# This program finds the average number of bytes in 

# a 5 minute time slice referred to here as chunk. 

# This code is not part of the overall main program as 

# we decided after the first data sets were created to 

# see if adding the average size of bytes per packet 

# would increase the classification rate 

# Bret Hyla September 2006 

open(FILE, "bytes_sorted.txt"); 
open(BYTES, ">bytes_in_chunk.txt") ; 

use Date::Calc qw(:all); 

# Set a base date to do the timestamp comparison, it #should 
be the first timestamp in sorted file from oldest #to 
newest. 

$yr="1999"; $mon="03";$day= "08"; $hr="13"; $min="00"; 
$sec="00"; 

$chunk=0; 

$good=0; 

$bytes_inchunk=0; 

while (<FILE>) { 

$line=$_; 

($date,$bytes)=(split /,/,$line); 
$bytes_inchunk=$bytes_inchunk+$bytes; 

$newtime=$date; 

$newtime=~ (s/T/:/); 

($yr2,$mon2,$day2,$hr2,$min2,$sec2) = (split 
/ [\/:]/,$newtime) ; 

($D_y,$D_m,$D_d,$Dh,$Dm,$Ds) = 

Delta_YMDHMS($yr,$mon,$day,$hr,$min,$sec, 

$yr2,$mon2,$day2, $hr2,$min2,$sec2); 


push @AoA, [ ($chunk, $sIP, $dIP, $sPort, $dPort,$good) ]; 

#looks at the delta in the two packet time stamps and if 
#condition is met creates a new chunk 

if ($Dm>4 or $D m>0 or $D d>0 or $Dh>0) { 


$yr=$yr2; 


$mon=$mon2; 
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$day=$day2; 
$min=$min2; 


$hr=$hr2; 
$sec=$sec2; 


$chunk++; 

print BYTES"$chunk, $bytes_inchunk \n"; 
$bytes_inchunk=0; 

} 

} #while 

close FILE; 


E. TEST_TRAIN.PL 

# This program splits the first file into two files 

# These files contain 80% and 20% of the original file. 

# Warning, if you have already sorted the data be sure you 

# have your classification groups as the primary key. 

open(FILE, "Info_normalized_entropy_2ndweekv2_sorted.csv") 
open(TEST, ">test.txt"); 
open(TRAIN,">train.txt") ; 

$chunk=0; 

print"$chunk is 0"; 

while (<EILE>) { 

$line=$_; 

#print" chunk is $chunk\n"; 
print TRAIN"$line"; 

$chunk++; 

if ($chunk>4){ 

print TEST"$line"; 

$chunk=0; 

print" chunk is $chunk\n"; 

} 

} 

close FILE; 

close TEST; 
close TRAIN; 
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F. ENTROPY_INFORMATION.PL 

open(FILE, "sorted.txt"); 

# several sections of code were based on tutorials found at 
#http://perldoc.perl.org/perllol.html 

use Date::Calc qw(:all); 

#initial a base date to do the timestamp comparison, it 
#should be the first timestamp in sorted file from oldest 
#to newest. 

# Bret Hyla September 2006 NPS 

$yr="1999"; $mon="03"; $day = "08"; $hr="13"; $min="00"; 
$sec="00"; 

$chunk=0; 

$good=0; 

while (<FILE>) { 

$line=$ ; 


($date,$sIP, $dIP, $sPort, $dPort,$comma)=(split /,/,$line) 

$g++; 


$sIPhashinfo{$sIP}++; $dIPhashinfo{$dIP}++; 
$sPorthashinfo{$sPort}++; $dPorthashinfo{$dPort}++; 

$newtime=$date; 

$newtime=~ (s/T/:/); 

($yr2,$mon2,$day2,$hr2,$min2,$sec2) = (split 
/ [\/:]/,$newtime) ; 

($D_y,$D_m,$D_d,$Dh,$Dm,$Ds) = 

Delta_YMDHMS($yr,$mon,$day,$hr,$min,$sec, 

$yr2,$mon2,$day2, $hr2,$min2,$sec2); 


push @AoA, [ ($chunk, $sIP, $dIP, $sPort, $dPort,$good) ]; 

#looks at the delta in the two packet time stamps and if 
#condition is met creates new chunk 


if ($Dm>4 or $D m>0 or $D d>0 or $Dh>0) { 


$yr=$yr2; 
$day=$day2; 


$mon=$mon2; 
$hr=$hr2; 



$min=$min2; 


$sec=$sec2; 


$chunk++; 

} 


} #while 


close FILE; 

# function to find unique source ips and their prob across 
#all data 

foreach $keyinfo (keys %sIPhashinfo) { 

$p++; 

$valueinfo =$sIPhashinfo{$keyinfo}; 

$probinfo=$valueinfo/$g; 

$probinfo{$keyinfo}=$valueinfo/$g; 


} 


print "num source unique ip keys $p \n"; 


# function to find unique dest ips and their prob across 
#all data 

foreach $key2info (keys %dIPhashinfo) { 

$q++; 

$value2info 

=$dIPhashinfo{$key2info} ; 
$prob2info{$key2info}=$value2 
info/$g; 


print "num dest ip unique keys $p \n"; 


# function to find unique source ports and their prob 
#across all data 

foreach $key3info (keys %sPorthashinfo) { 

$r++; 

$value3info =$sPorthashinfo{$key3info}; 

$prob3info=$value3info/$g; 

$prob3info{$key3info}=$value3info/$g; 

} 

print "num source port unique keys $r \n"; 

# function to find unique dest ports and their prob across 
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# all data 

foreach $key4info (keys %dPorthashinfo) { 

$s++; 

$value4info 

=$dPorthashinfo{$key4info} ; 
$prob4info=$value4info/$g; 
$prob4info{$key4info}=$value4 
info/$g; 


} 

print "num dest port unique keys $s \n"; 
print" total lines read is $g"; 

open (INFO,">info.txt"); 

open (ENTROPY, ">entropy.txt"); 

$prior=0; 

for $i (0.. $#AoA) { 

if ($prior != $AoA[$i][0]){ 

# print "\nNew Prior: $prior\n\n"; 

foreach $key (keys %sIPhash) { 

$value =$sIPhash{$key}; 

$prob=$value/$t; 

# $sum = $sum +$prob; 

$entropyeach=-l*( $prob* log($prob) ); 
$infoeach = -1* log($probinfo{$key}); 

# print " info is $infoeach\n"; 

$totalinfo = $totalinfo + $infoeach; 

$totalentropy= $entropyeach +$totalentropy; 

} 

foreach $key2 (keys%dIPhash) { 

$value2 =$dIPhash{$key2}; 

$prob2=$value2/$t; 

# $sum2 = $sum2 +$prob2; 

$entropyeach2=-l*( $prob2* log($prob2) ); 
$infoeach2 = -1* log($prob2info{$key2} ) ; 
$totalinfo2 = $totalinfo2 + $infoeach2; 

$totalentropy2=$entropyeach2 +$totalentropy2; 

} 

foreach $key3 (keys%sPorthash) { 
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# 


$value3 =$sPorthash{$key3}; 

$prob3=$value3/$t; 

$sum3 = $sum3 +$prob3; 

$entropyeach3=-l*($prob3* log($prob3) ); 
$infoeach3 = -1* log($prob3info{$key3}); 
$totalinfo3 = $totalinfo3 + $infoeach3; 

$totalentropy3=$entropyeach3 +$totalentropy3; 

} 

foreach $key4 (keys%dPorthash) { 


$value4 =$dPorthash{$key4}; 

$prob4=$value4/$t; 

# $sum4 = $sum4 +$prob4; 

$entropyeach4=-l*( $prob* log($prob4) ); 
$infoeach4 = -1* log($prob4info{$key4}); 
$totalinfo4 = $totalinfo + $infoeach4; 

$totalentropy4=$entropyeach4 +$totalentropy4; 


print ENTROPY "$prior,$t, 

$totalentropy,$totalentropy2,$totalentropy3,$totalentropy4\n 

If , 
f 

print INFO "$prior,$t, $totalinfo,$totalinfo2,$totalinfo3, 
$totalinfo4\n"; 

$prior= $AoA[$i][0]; 


# clearing all variables for the next time slice 
$t=0; 

undef $sum; undef $key; undef $prob ; 
undef $entropyeach; undef $infoeach; 
undef $totalinfo; undef $totalentropy; 
undef $sum2; undef $key2; undef $prob2; 

undef $entropyeach2; undef $infoeach2; undef $totalinfo2; 
undef $totalentropy2; undef $sum3; undef $key3; 

undef $prob3; undef $entropyeach3; undef $infoeach3; 
undef $totalinfo3; undef $totalentropy!; 
undef $sum4; undef $key4; undef $prob4; 

undef $entropyeach4; undef $infoeach4; undef $totalinfo4; 
undef $totalentropy4; undef %sIPhash; undef %dIPhash; 

undef %sPorthash; undef %dPorthash; 

} # if ($prior != $AoA[$i][0]) 

$t + +; 

$srIP= $AoA[$i][1]; 
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$sIPhash{$srIP}++; 

# print " count after t++ $t"; 

# print" $t"; 

$dtIP= $AoA[$i][2]; 

$dIPhash{$dtIP}++; 

$srPort= $AoA[$i][3]; 
$sPorthash{$srPort}++; 

$dtPort= $AoA[$i][4]; 
$dPorthash{$dtPort}++; 


} # for $i (0.. $#AoA) brace 


close ENTROPY; 
close INFO; 
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